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Abstract 

Two methods for estimating measures of pass-fail reliability are 
derived. The methods require only a single test administration and are 
computationally simple. Both are based on the Spearman-Brown formula for 
estimating stepped-up reliability. The non-distributional method requires 
only that the test be divisible into parallel half-tests; the normal method 
makes the additional assumption of normally distributed test scores. Bias for 
the two procedures is investigated by simulation. For nearly normal test 
score distributions, the normal method performs slightly better than the non- 
distributional method, but for moderately to severely skewed or symmetric 
platykurtic test score distributions the non-distributional method is 
superior. Test results from a licensure examination are used to illustrate 
the methods. 



KEY WORDS: Cohen's kappa, licensure examination, pass-fail reliability, 
Spearman-Brown formula. 
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Introduction 

A primary component of the Standards for Educational and Psychological 
Testing (APA, 1985) with respect to licensure and certification examinations 
requires test publishers to report the reliability of pass-fail decisions 
(hereafter referred to as PF reliability). Hanibleton and Novick (197^) 
proposed 8, the proportion of consistently classified examinees, as a 
measure of PF reliability, Swarinathan, HambDeton, and Algina (197^) 
suggested that Cohen's (1960) kappa coefficient, denoted by tc, be used in 
place of 9. Coefficient ic is the proportion of consistently classified 
examinees, corrected for chance. Though it is commonly thought of as a 
measure of association rather than of agreement, <(), the Pearson correlation 
between two dichotomous variables, (equals k under certain circumstances that 
will be discussed. Thus, (p may also be used as a measure of PF reliability. 

If two parallel test forms are available for administration to the same 
sample of examinees, then estimates for 9 and ic are easily obtained by the 
method of moments. If only one form of the test may be administered, then 
obtaining estimates for 9 and ic becomes much more difficult, both 
theoretically and computationally, Huynh (1976) developed a procedure for 
estimating 9 and ic which is based on a beta-binomial model and requires only 
one test administration. The computations involved are quite intricate, but 
Huynh (1976) also suggested a simpler method based on a normal 
approximation, Peng and Subkoviak (1980) further simplified Huynh's ^^976) 
approximate method and presented evidence suggesting that their simplified 
procedure is superior to Huynh' s, Brennan (1981) supplied tables which make 
the computations for Peng and Subkoviak's (1980) procedure relatively 
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simple* Subkoviak (1980) discussed several other methods for 
estimating 6 and k: when only one test form is available* 

The purpose of this paper is to derive and illus+'rate two theoretically 
and computationally simple methods by which both 9 and ic may be estimated from 
a single test administration, when the test is oivisable into parallel half- 
tests. One of the methods is based on normal theory; the other makes only 
minimal discributional assumptions. Bias for the two procedures is evaluated 
under a variety of test score distributions and test reliabilities using 
simulation techniques. 

Derivation of the Methods 

Let X denote the total test and Yl and Y2 the parallel haif-tests (Lord 
and Wovick, 1968) into which X is divisible. As will be seen later, the 
statistical assumptions defining parallelism for Yl and Y2 may be relaxed so 
long as Yl and Y2 are parallel (h'^-nogeneous ) in content. Let A denote the 
dichotomous variable that equals 0 when an examinee fails X and equals 1 when 
an examinee passes X. The dichotomous variables Bl and B2 are similarly 
defined for Yl and Y2. The three variables A, Bl, and B2 require that passing 
scores be set for X, Yl, and Y2. The passing score for X is usually 
determined, at least in part, from criterion information distinct from the 
pass race. It is assumed, however, that the passing scores fc. Yl and Y2 are 
determined so that the pass rates for Yl and Y2 are identical to that of X. 

The proportion parameters describing the variables Bl and 82 may be 
expressed in the usual format of a 2 by 2 table as follows: 
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Value of 31 


Value 
0 


of B2 

1 


Total 


0 


^00 


^01 


q 


1 


^0 




p 


Total 


q 


p 


1 



In this table tt^q is the proportion of examinees in the population of interest 

who fail both Y1 and Y2, and tt^^ , tt^^. and tt,^ are defined analogously. 

Because the pass rate (p) is the same for B1 and B2, tt^. = tt... In terms of 

U I 10 

these proportion parameters, 



9 = ^00 ^11 ' ^"^ ^2) 
K = (9 - 9, )/(1 - 8^) , (3) 

where 6^ = p^ + . The parameter is the value of G when B1 and B2 are 
statistically independent. Given the above assumption that the pass rate is 
the same for 81 and B2, onf^ can show that tc = <}> , as first noted by Cohen 
(i960) who also stated that ic and <}) are nearly identical so long as the pass 
rates diiTer by no more than .10. 

Estimates for <}>, 9, and < obtained by substituting observed proportions 
into (1), (2), ano (3) would pertain to decisions based on the half-tests Y1 
and Y2 and not, as is desired, on the whole test X. Let Y1 and Y2 be the 
doubled in length versions of Y1 and Y2, when the lengthening is in accordance 
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with the model of parallel measurements. Thus, Yi and Y2 are parallel 

u y, «• 

forms of X. Let B1 and be the dichotomization of Yi'' and under the 

asumption that the passing scores for YI and Y2 are chosen so that the pass 
rates are the same as for YI , Y2, and X. Finally, let ^ = ic and 0 be the 
PF reliability coeffioien^s corresponding to 31 and B2 (and, consequently, 
to A), Expressions for these coefficients will be derived below. 
Calculations Using Normal Theory 

One simple way to estimate ^ and 6 is a straightforward modification of 
the Huynh-Peng-Subkoviak (hi S) procedure to fit the present model of parallel 
half-tests. One can drop the HPS beta-binomial model assumption for item 
sampling, but keep the bivariate normal approximation for test scores. 
Let = (K - Vy^)^^y^ > where K is the passing score on the total test, 
and and o are the population mean and standard deviation for the total 
test. Under the assumptions of the model, q = PCZ ^ K^], where Z is a 
standard normal random variable, and p = 1 - q. Furthermore 



where Z-j and Z2 have a standard bivariate normal distribution with correlation 
coefficient p 

To estimate K^, one can replace and with their corresponding sample 

estimators. From the parallel half-scores YI and Y2, one can estimate the 

half-test reliability, p, then step it up to an estimate, r , of the full- 
SB 

test reliability, p , using the Spearman-Brown formula. The estimates for 
and p can then be substituted in Equation 4 to estimate ttoo • The tables in 
Brennan (1981) may be used to look up values for the estimate of ttoo > ^s may 
other tables of the bivariate normal oiatr Ibution function. 
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Estimates for the probabilities tTq^ and ir^ ^ can then be computed from 
♦■he relationships itoo + "^oi = Q "^m + ^^oi = P • 

Since 9 = ttoo ^ '^w ^nd p = < = 1 ^ ttqi /(pq) > the normal model estimates 
foj" TToo ^ \\ y TToi be immediately converted to estimates 
for 9 and p . 

The practical difference between the method proposed here and the HPS 
method \s the use of the stepped-up reliability estimate, r^g, in place of 
KR21 , The basic theoretical difference is that the method proposed here 
relies on a parallel half-test model rather than on a beta-binomial model for 
the full test. 
A Non-Distributional t^ethod 

In the normal theory model above, trie half-length reliability , o, was 
stepped up to the full-length reliability, p , by the Spearman-Brown 
formula. Alternatively, one could step up che half-length PF 
reliability, (J), directly: 4)„ = 2(()/(l+<})) . Substituting the expression 
for 4) given in (1) into this expression for and simplifying yields 



where p = tt^^ ^ii pass rate and q = 1-p, Because 

<t) = 1 - TToi /(pq), it follows that 



*sB * - IFT 2^Tq7 • 



(Note that the pass rate p = tTq^ * 1 ^ ^01 ^^11 same for both the 

half-length and the full-length tests.) Vo^ non-zero tto i , the left side 
difference in (6) is zero if and only if ttqi = [1/(1 + (j))37roi* However, 
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because 0 < tt^^ S , the first term of the right side difference is 
greater than 0 and less than iTQ^/Cpq), and this leads to the following upper 
and lower bounds for the left side difference in (6): 



^ ^ * ^-7:::'^ — r— • (7) 



2pq - tTq^ ' ^S3 ^ - pq 2pq - tToi * 

Since each side of this inequality approaches 0 as it approaches 0, * 

0 1 SB 

becomes a better approximation to (J) as the half-test reliability increases, 
though, as will be argued, (j) oan be a useful approximation to A when the 
half-test reliability is only moderate. 

Technically, (J)^^ is the reliability of a test composed of the two 
dichotomously scored parts Bl and B2 or, equivalently , the correlation between 
two parallel forms of 3uch a test. These test scores would be trichotmous 
variables taking the values 0, 1, and 2, Consequently, the interpretation 
of <j)gg as the correlation between Bl and B2 is an approximation, 
because Bl and B2 are dichotomous variables. 

More specifically, let C = Bl + B2 and C = Bl ' + B2' where the prime 

deno;es a parallel measurement. The coefficient <()-,_ equals the correlation 

SB 

betv/een the two parallel measurements, C and C, both of which may take the 
values of 0, 1, cr 2, For the measurement C, the value 0 occurs if the 
examinee fails3 both Bl and B2, The value 1 occurs if the examinee pass one 
but fails the other, if the examinee passes both Bl and B2, then C takes the 
value of 2, The values of C are similarly defined in terms of Bl ' and B2'. 

If 31 and B2 are reasonab?y reliable, then there should be- relatively few 
examinees with scores of 1 on C and C\ Assume that the group of examinees 
scoring 1 on C is approximately the same as the group of examinees with scores 
of 1 on C\ Let the dichotomous variables D and D" be defined by dividing 
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to TTg^ and -n^^ are minor in relation to the sample size, then p^^ and p^ ^ may 
be replaced by their average, and the marginal proportions adjusted 
accordingly* The above formulas are then applied to the modified 2x2 table 
of observed proportions* 

Simulation Results 

A Monte Carlo investigation was undertaken to evaluate the accuracy of 
the non-distributional procedure as well as the normal procedure* An 
important application of PF reliability indices is in the area of 
certification and licensure examination wliere there is usually at least 
several hundred and often times many thousand examinees taking the test. 
Hence, it was considered to be more important to investigate the bias of the 
procedures for large sample sizes rather than to compare the small sample 
standard errors of the procedures. 

The present simulations were undertaken on an IBM il38l mainframe using 
SAS version 5 (SAS Institute Inc., 1985), except that IMS'. (IMSL, 1 987) 
function BNRDF was used to evaluate the bivariate normal cumulative 
distribution function for the normal method. Six simulation situations were 
considered. Within each of three different test score distribution shapes: 
nearly normal, platykurtic, and negatively skewed, two full-test reliabilities 
were considered: .92 and .71, where the full-test reliability is defined as 
the Spearman-Brown stepped-up correlation between the half-tests. For each of 
the six simulation situations, two replications were done where one 
replication consists of the generation of four half-test scores for 20,000 
examinees under the following model: 

^ij " ^^'^i " ^^ij " ^ i = 1, . . ., 20,000 and j = 1, 2, 3, ^, 

where Y and $ are parameters and T and all {E^^} are independent variates 
generated from the standard normal distribution. All half-test scores were 
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rounded to integer values, and the full-test scores were computed as 
3 + Y2 and X2 = + Y^. Various degrees of symmetrical and 

asymmetrical truncation on T and to a much lesser extent symme*"rical 

♦cation on the E'3 was used to control the distributional shapes of the 

test scores being generated. Formulas for the mean and variance of truncated 

normal variables are available (Johnson & Kotz, 1970), and these in 

combination with the Y and 6 parameters permitted some control over the means, 

variances, and reliabilities of the test scores, 

Full-trst characteristics for the six simulation situations are presented 

in Table 1. These situations were chosen as representative of those 
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encountered in practice. Test score distributions for licensure, 
certification, and various other selection examinations are frequently but not 
exclusively found to be negatively skewed as is illustrated by an example 
presented later. Alternatively, Lord (1955) found for professionally 
constructed educational examinations that if they were symmetric they tended 
to be platykurtic (flat). Lord's (1955) finding was reaffirmed with a random 
sample of ^0,000 examinees from a recent administration of the ACT Assessment 
examination. The distributions of raw scores for the ijur subtests comprising 
the ACT Assessment were approximately symmetric with skews ranging from -,2 to 
+,25 but with kurtoses ranging from -,^0 to -,96, Finally, since test scores 
are inherently bounded, rather than consider an exactly normal distribution 
situation, a nearly normal situation with slight platykurtosis was used 
instead. Though full-test reliabilities for professionally constructed 
examinations are usually in the nineties or at least the eighties, subtest 
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reliabilities may b^=^ lower and so both high and low reliabilities represented 
by ,92 and .71 were considered. 

Failure rates of ten and *"hirty percent were selected for investigation 
as they seemed to represent a realistic range. Due to the integer nature of 
the generated test scores, it was not always possible to achieve exactly ten 
or thirty percent failure rates for the full-tests. Rather, the failure ratej 
ranged from 8.5% to 11$ and from 29% to 32.5% across the six situationJi. 

For an estimator T of some parameter i);, bias is defined as E(T) - 
The parameters of interest are the PF reliability indices 6 and = For 
the distributions modeled, the theoretical values of these parameters are not 
known. However, the simulations included the generation of two full-test 
scores for all simulated examinees, and consistent estimates for the 
parameters were obtained by applying the method of moments (MOM) to the 2 by 2 
table derived from the pairs of full-test scores. With a N of 20,000 these 
consistent MOM estimates should, for practical purposes, accurately reflect 
the true parameter values. In what follows, t; ose estimates are denoted by a 
carat, but have no subscript. The two estimation methods compared were the 
normal and non-distributional methods both of which are computed for the first 
full'-test score X-j = Y-j + Y2 only. Normal method estimates have N as a 
subscript while the non-distributional method estimates have SB as a 
subscript. The bias for each method is estimated as the difference between 
its estimate and the consistent MOM estimate. This approach of using the MOM 
estimate as the parameter is similar to that used by Peng and Subkoviak 
(1980), Huynh and Sanders (198O), and Subkoviak (1978). 

Table 2 presents MOM estimates for the PF reliability indices 
6 and <() k as well as estimated biases for the normal and non- 
distributional methods. Results are presented for two r^eplicat ions under all 
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six simulation situations and for approximate fail rates of ten and thirty 
percent. 
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The replications in Table 2 reveal some variability in the bias estimates 
even with an N of 20,000. Despite this variability, clear patterns do 
emerge. Focusing first on 9, it can be seen that for the two nearly normal 
situations the normal method is never significantly worse and in one case 
appreciably better than the non-distributional mecnod, though the bias for 
both methods is modest. With the four non-normal situations, the pattern is 
reversed. The non-distributional method is never substantially worse and 
usually considerably better than the normal method, but again, both methods 
usually show only modest bias. 

Turning next to ((>, the pattern is similar but the biases are generally 
larger, the latter result having also been observed by Peng and Subkoviak 
(1980) and Huynh and Sanders (1980) with their methods. For the tv/o nearly 
normal situations, the normal method is appreciably better than the non- 
distributional method though the latter method performs reasonably well. With 
the four r*on-normal situations, the non-distributional method usually performs 
fairly well and is considerably better than the normal method which has rather 
large bias when the fail rate is 10$, 

It is interesting to note that while the normal method sometimes yields 
positive and sometimes negative bias estimates, the biases in Table 2 are 
always positive for the non-distributional method. In the derivation of the 
non-distributional method, it was suggested that insofar as the method may be 
biased, the bias would be positve and attributable to attenuation due to 
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grouping* It is also worthwhile to note that previous simulation studies by 
Peng and Subkoviak (1980) and Huynh and Sanders (1980) found that the HPS 
method and Huynh' s beta-binomial method had biases similar in magnitude to 
those found for the pr^^sent methods, though the previous studies concentrated 
on short tests while the focus of the present study is long tests. Houever, 
Huynh's beta-binomia". model was applied to the current simulated data, but the 
results are not reported because, as was expected, its performance was very 
similar to the performance of the normal method. (In applying the beta- 
binomial metnod, the number of items on each test was chosen so that KR21 was 

close in value to p.. with the constraint that test length could never be less 

SB 

than the maximum observed score.) 

In summary, neither method shows large bias when estimating 9, though the 
normal method generally shows less bias than the non-distributional method 
when the cest scores are approximately normally distributed while the opposite 
holds when they are not. In estimating (}), the non-distributional method 
generally shows Tiild to moderate positve bias, but is considerably less biased 
than the normal method when the test scores are not noriiially distributed with 
the reverse being true when they are. These results indicate that when the 
sample size is large and the test score distribution shows substantial 
departures from normality the non-distributional method should yield more 
accurate estimates of 6 and especially ^ than the normal method. 

In the next section, the methods are illustrated with data from a 
licensure examination. 

An Illustrative Example 
The data used here are from a licensui c? examination containing 300 scored 
items. The test is divided into two separately timed parts consisting of 150 
scored items each. The two parts were constructed to be equally difficult 
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based on field test data and were matched in content according to the test's 
table of specification. A group of approximately twenty expert judges rated 
the 300 scored items using the Angoff (1971) method. The judges also rated 
what proportion of items a minimally competent examinee should answer 
correctly in each of the many content areas covered by the test, A passing 
score for the total test of at least 200 items correct was determined from a 
weighted average of the judges' item and area ratings. 

The method requires that passing scores be determined for the two parts. 
From a strictly statistical perspective, these passing scores should be chosen 
so that the passing rates on the two parts are equal to each other and to the 
percentage passing on the full test. If a representative sample of examinees 
is available, then the half-test passing scores may be determined solely from 
the passing rates. The half-test passing scores need not be taken as one-half 
of the fulj test passing score, nor do the half-test passing scores have to 
sum to the full test passing score. In general, these last two conditions 
will not be fulfilled when the half-test passing scores are determined by 
equating the passing rates. 

In many applications, it may be possible to integrate psychometric and 
statistical considerations. If criterion data, such as expert judges' 
ratings, are available, then these data may also be employed in determining 
the half-test passing scores. Consider the present example: In rating the 
items and areas, the judges* only concern was to establish a passing score for 
the total test which would determine whether or not an examinee should be 
licensed. However, using the same weights as were used to determine the 
passing score for the total test, the item and area ratings were also used to 
determine passing scores for the two parts. Both parts received 100 items 
correct as passing scores based on the expert judges' ratings. After the test 
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was administered and the results analyzea, these passing scores were changed 
to 102 for part one and 98 for part two. The reason for the change was that 
in the total group of examinees the average score for part one was 
approximately four points higher than that for part two, and with these 
adjusted passing scores the part one and part two passing rates were nearly 
identical to each other and to the full test passing rate. In this example, 
empirical results from a large representative sample were used to adjust the 
judges' ratings. Note that no decisions about examinees were based on the 
part one and part two passing scores. Their only function is in estimating 
Ghe full test PF reliability. The fact that the half-test passing scores sum 
to the full test passing score is due to the long length and corresponding 
high reliability of the two parts. If the two parts were shorter, this 
condition would likely be violated. 

Summary statistics for the total group and selected subgroups of 
examinees are presented in Table 3. 
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The subgroup data are presented to illustrate the method for different sample 
sizes and for groups wi*;h different observed passing rates. The determination 
of the subgroups is based upon whether an examinee was taking the test for the 
first time or was repeating the test; and whether an examinee graduated from 
an accredited or nonaccredited university. 

The sample alpha coefficients are derived from scores on the 
examination's five subtests which differ in content, rather than from the item 
scores. This is why the sample alphas are slightly smaller than the sample 
KR21's. Since the subtests differed widely in length, average subtest scores 
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(rather than total subtest scores) were used for computing the alphas. 

One would generally expect that the stepped-up reliability coefficient, 
r^^ - 2r(Y1,Y2)/Cl + r(Y1,Y2)] , would be larger than KR21 , though that is 
not always the case in Table 3. Most likely, this is due to the long length 
of the test. In any case, the two reliability coefficients are very similar 
in all subgroups. 

The data in Table 3 show that while Y1 and Y2 have similar standard 
deviations, their means tend to differ; hence Y1 and Y2 are not precisely 
parallel. Moreover, the negative skewness coefficients suggest moderate to 
severe departures of the data from normal distributions. 

The stepped-up phi coefficient, , is based, tough, on neither of 

OD 

these assumptions, but on the assumption that B1 , B2, and A have the same pass 
rates. An indication of how well the data satisfy this assumption can be 
found in the pass rate column of Table 3- The passing scores for Y1 and Y2 
were chosen so that the assumption would be fulfilled in the total group of 
examinees. The assumption continues to be met in the group of accredited 
first-time examinees but is violated to varying degrees in the remaining three 
groups. As was previously discussed, the observed proportions may be smoothed 
in the application of the non-distributional method. The "smoothed half-test 
proportions" in Table 4 were obtained by replacing the two off-diagonal 
proportions with their average. The estimated full-test proportions and PF 
reliability indices in Tabel 4 were computed from the smoothed proportions. 
Though the sample proportions in Table 4 are reported to only 3 digits, the 
computations for the PF indices used H digits. 



Insert Table ^ about here 



Pass-Fail Reliability 
19 

The HPS estimates of the full-length reliability indices are based on 
Brennan^s (1981) tables. Because, as was noted above, KR21 is nearly 
identical to the stepped-up reliability coefficient, r , the HPS reliability 

oB 

indices a^e nearly identical to those that result from applying the normal 
model to the half-test data, as discussed earlier in this paper. For this 
reason, the PF reliability indices associated with the normal model are 
omitted from Table ^. 

Comparing the SB and HPS estimates in Table H shows that they yield 
nearly identical estimates for 6 , but that their estimates for k are 
sometimes discrepant. The results from this example are consistent with the 
simulation results. V/hen N is large and the test score distribution is 
substantially skewed, as in the ^irst two groups in Table 4, the two methods 
give substantial different estimates for <() = tc. The simulation results 
indicate that the SB method estimates should be more accurate, and this is 
supported in this example by the similarity of the HPS estimates to the 
unstepped-up ^alf-test MOM estimates of <(). Doubling the length of the test 
should increase ((> at least a moderate amount; that was always the case in the 
simulations. The other three groups are considerably less skewed (with fairly 
normal kurtoses also) and here the HPS and SB estimates for <p are more similar 
and substantially increased over the unstepped-up half-test MOM estimates 
of <(). 

Summary and Discussion 
The methods for computing PF reliability presented in this paper require 
only one test admini^itration and use the Spearman-Brown formula tc obtain 
stepped-up estimates of PF reliability which are computed from parallel half- 
tests. They thus require that the test be divisible into two parts that arc 
equivalent in their content and approximately equivalent in certain 
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statistical characteristics. If this is not the case, then one of the beta- 
binomial model based methods discussed by Subkoviak (1980) could be used such 
as the one by Huynh (1976), However, the beta-binomial method is 
computationally complex and appears more appropriate when tests are short ipA 
homogeneous in content and item difficulties. For long tests which are 
heterogeneous in content and item difficulties, such as licensure 
examinations, the Peng and Subkoviak (1980) approximation should yield results 
nearly identical to those from the beta-binomial method, Brennan (1981) 
presents tables which make the Peng and Subkoviak computations relatively 
simple, Brennan (198l) also discusses other PF reliability indices in 
addition to 0 and ic • 

If the test is divisible into parallel half-tests, then the methods 
derived within possess certain advantages. Instead of KR21 , which is used in 
the HPS method, the normal method uses a Spearman-Brown stepped-up half-tests 
intercorrelation as an estimate of the correlation between two full-tests. 
This latter estimate is based on less restrictive assumptions than KR21 , and 
as a consequence the normal method has wider applicability than the HPS 
method. In particular, it should be better suited to long heterogeneous tests 
such as licensure examinations, though this may not always be the case as is 
illustrated by the example given within. Of more importance, however, is the 
non-distributional method which discards distributional assumptions 
altogether. The simulation results support the conclusion that when N is 
large and the test score distribution is non-norm?l, the non-distributional 
method will yield more accurate estimates than the normal method especially 
for (fi ' K and especially for smaller (,10) failure rates. 

Though the non-distributional method outperformed the normal method when 
normality was violated, it still displayed mild to moderate bias. The bias. 
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however, was always positive, in contrast to the normal method, and this 
suggests that it may be worthwhile to investigate strategies for correcting 
the bias. Also, t^^ magnitude of the biases found here were generally similar 
to those found by Peng and Subkoviak (1980) for their approximate mechod and 
to those found by Huynh and Sanders (1980) for the beta-binomial method, 
though these two studies focused on short tests. Finally, the simulation 
results obtained here are only applicable when sample sizes are large. An 
investigation of the behavior of the non-distributional method when sample 
size is small, test length is short, and test score distributions are non- 
normal could extend the applicability of the method to situations other than 
the one of professional licensure and certification examinations which was 
considered here. 
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Table K Characteristics of the Simulation Distributions 



Dist. Rel, Mean S.D» Skew Kurtosis 



Normal* 


.92 


160 


23.0 


.01 


.17 


Normal* 


.V 


70 


8.5 


.00 


.1i< 


Flat 


.92 


^H0 


25.^ 


.01 


.93 


Flat 


.71 


50 


8.7 


-.01 


.65 


Skewed 


.92 


165 


20.5 


-.84 


.iJ7 


Skewed 


.71 


57 


5.5 


-.63 




*The3e are 


nearly normal with 


slight 


platykurtosis. 





9.5 

o 

ERIC 



Table 2; Parameter Estimates and Bias Estimates for the Pass-Fail Reliability Indices ~ N=20,000 and 4) = < 
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30? Failing 




10$ Failing 




30$ 


Failing 


Dist* 
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•J) 


*sb"* 






*SB-* 




Nearly 


.92 


1 


.91 


.007 


-.003 


.89 


.017 


.000 


. 70 


• U J *t 




. 1 D 


.010 


.001 


Normal 






.95 


.007 


-.001 


.88 


.021 


.003 


.71 


.029 


.018 


.72 


.051 


.008 


Nearly 


.71 


1 


.90 


.001 


-.006 


.78 


.008 


-.005 


.11 


.001 


006 

• \J \J\J 


'48 


.026 


.007 


Normal 




2 


.90 


.001 


-.001 


.79 


.01 3 


-.009 


.10 


.01 1 


.016 


• 50 


.035 


-.006 


Flat 


.92 


1 


.92 


.012 


.028 


. 90 


.015 


-.012 


.'5'5 


.086 


1 


77 


.037 


-.013 






2 


.92 


.009 


.026 


.91 


.016 


-.011 


.57 


.066 


.111 


.78 


.011 


-.050 


Flat 


.71 


1 


.88 


.008 


.018 


.79 


.008 


-.015 


.31 


.053 


. 1 18 


.53 


.012 


-.037 






2 


O T 

.87 


.01 2 


.021 


.79 


.003 


-.01 1 


.29 


.088 


.121 


.52 


.000 


-.035 


Skewed 


.92 


1 


.96 


.009 


-.007 


.91 


.016 


-.030 


.11 


.011 


.089 


.78 


.038 


-.017 






2 


.96 


.010 


-.008 


.91 


.013 


-.029 


.11 


.019 


.088 


.78 


.031 


-.017 


Skewed 


.71 


1 


.93 


.005 


-.022 


.80 


.013 


-.011 


.56 


.057 


.11? 


.55 


.031 


-.051 






2 


.93 


.000 


-.026 


.SO 


.012 


-.037 


.57 


.030 


.156 


.51 


.027 


-.038 



Table 3- Summary Statistics and Reliability Coefficients, by Examinee Group 

Summary statistics Reliability coefficients 

Examinee Pass 





M 


var laoie 


Mean 


SD 


Skewness 


rate 


Alpha (X) 


KR21 (X) 


r(Y1 , Y2) 


■"SB 


Total group 


i<828 


X 


236 


.06 


29.78 


-1.11 


.900 


.92 


.95 


.90 


.91 






Y1 


119 


.93 


15.11 


-1 .18 


.891 














Id 


116 


.13 


1 c )l 0 


-1 . 24 


. 897 










Accredited 


3999 


X 


2i11 


.96 


22.59 


-0.80 


.956 


.87 


.91 


.81 


.91 


first-time 




Y1 


122 


.97 


11.39 


-0.88 


.952 














I c 


118 


.99 




















V 
A 


22i1 


.01 


do. 73 


-0. 27 


. o1 2 


.90 


.93 


. 88 


.9'! 


repeating 




Y1 


113 


.59 


11.79 


-0.39 


.790 














Y2 


110 


.il2 


11.81 


-0.17 


.821 










Non-accredited 




X 


187 


.22 


19.61 


-0.1 1 


.101 


.96 


.98 


.91 


.97 


first-time 




Y1 


95 


.01 


21.28 


-0.19 


.136 














Y2 


92 


.21 


26.06 


-0.17 


.136 










Mon-accredited 


187 


X 


169 


.71 


39.76 


-0.03 


.209 


.91 


.96 


.90 


.95 


repeating 




Y1 


86 


.05 


20.38 


-0.1 3 


.198 














Y2 


83 


.69 


20.38 


0.01 


.251 
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Table ^: Pass-Fail Proportions and Reliability Indices, by Examinee Group 
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PF 


Half 


-test 








PF reliability 


indices 




Examinee 




de 


7 is ion 


proportion 


Full-test 


Half 


test 




Full-test 




group 


N 


IT 


B2~ 


Raw 


Smoothed 


Proportion 




























<fl = K 


e 


X 

^SB-'^SB 


SB 


Hro nro 


^HP 


Total group 


i<828 


0 


0 


r\ T 0 


A T Q 


• 089 


.7? 


.95 




Q 7 


- ( D 


.95 






0 


1 




AO ^ 

. 020 


015 


















1 


0 


.025 


.026 


.015 


















1 


1 


.870 


.870 


.881 














Accredited 


3999 


0 


0 


.030 


A O A 

. OiO 


.037 


.59 


.96 


,n 


oft 


. 0 1 


.98 


f i t-t imp 




0 


1 


.01 0 


A 1 n 

.01 y 


.012 


















1 


0 


A O 1 

. 021 


A 1 n 

.01 y 


.012 


















1 


1 


.931 


.931 


.938 














Accredited 


518 


0 


0 


. 1 33 


. 1 33 


.156 


.61 


.88 


.76 




7ft 


.92 






0 


1 


.011 


. UO 1 


.038 


















1 


0 


n )i A 


A 

. Ool 


.038 


















1 


1 


.715 


.7^5 


.768 














Non-accredited 


91 


0 


0 


.500 


.500 


.527 


.71 


.87 


.85 


• 93 


.90 


.95 


f irst^time 




0 


1 


.061 


.061] 


.037 


















1 


0 


.061 


.06i| 


.037 


















1 


1 


.372 


.372 


.399 














Non-accredited 


187 


0 


0 


.722 


.722 


.71'! 


.69 


.89 


.82 




.78 


.92 


repeating 




0 


1 


.080 




.032 


















1 


0 


.027 


.05^1 


.032 


















1 


1 


.171 


.171 


.193 
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